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Abstract — The Kernel Polynomial Method (KPM) is one 
of the fast diagonalization methods used for simulations of 
quantum systems in research fields of condensed matter physics 
and chemistry. The algorithm has a difficulty to be parallelized 
on a cluster computer or a supercomputer due to the fine-gain 
recursive calculations. This paper proposes an implementation 
of the KPM on the recent graphics processing units (GPU) 
where the recursive calculations are able to be parallelized in 
the massively parallel environment. This paper also illustrates 
performance evaluations regarding the cases when the actual 
simulation parameters are applied, the one for increased 
intensive calculations and the one for increased amount of 
memory usage. Finally, it concludes that the performance on 
GPU promises very high performance compared to the one on 
CPU and reduces the overall simulation time. 

Keywords-GPGPl], Kernel Polynomial Method, Condensed 
Matter Physics, CUDA 

I. Introduction 

Today's technological achievement in our everyday life 
is based on years of fundamental research for a wide 
variety of materials with fascinating functionalities such as 
semiconductors, magnets, and superconductors. Researchers 
in condensed matter physics revealed long ago that those dif- 
ferent properties of materials result from different behaviors 
of electrons, which are described by quantum mechanical 
equation of motion. Although it has been more than 80 
years since quantum mechanics was established, there are 
still many properties of matters whose origins are yet to 
be understood. Such examples include copper based high 
temperature superconductors (H] and some of magnetic 
insulators of organic compounds f2\. The common feature 
of these systems is a strong quantum correlation between 
electrons, which is turned out to be crucial for determining 
their properties. It is precisely this strong correlation that 
makes it difficult to treat these systems analytically without 
introducing any bias in theory. 

The best way to treat the strong quantum correlations is 
to solve quantum mechanical equation of motion numeri- 



cally exactly. Because of exponential increase of degrees of 
freedom with the number of electrons ~ 0(10^"^), we must 
still resort some sort of approximations. However, unlike 
analytical treatments, numerical simulations can handle the 
strong correlation effects with controllable approximations. 
Among many, well established numerical methods thus far 
are exact diagonalization method [31, [4|, quantum Monte 
Carlo method fS), density-matrix renormalization group 
method f6l, f?!, ID, Q, and kernel polynomial method 
(KPM) |10|. Each method is suited to particular sets of 
problems and at the same time each has severe limitations. 
For instance, the exact diagonalization method is able to 
evaluate the ground state (and low energy excited states) in 
high accuracy, but it is limited to a small size of systems. 

The simulation evaluates various physical quantities such 
as density of states (DoS) and Green's functions for elec- 
trons, which are necessary to study electronic structures. In 
particular, a straightforward method to calculate the DoS 
by diagonalizing a Hamiltonian matrix requires computa- 
tional complexity 0{D^), where D is the system size. 
This complexity is a performance bottleneck to evaluate 
higher energy excited states. In this respect, the KPM has 
an exceptional advantage because the KPM reduces the 
complexity of diagonalization to 0{D) at most by truncating 
polynomial expansions, which in turn controls the accuracy 
of the approximation. Thus, this paper focuses on the KPM 
which appropriately evaluates the DoS and Green's function 
including higher energy excited states ifTOl . 

The KPM is an approximation method based on poly- 
nomial expansions from which physical quantities are eval- 
uated. In particular, the Chebyshev expansion is the most 
common and useful polynomial to be applied. To avoid 
the Gibbs phenomenon due to truncated polynomial expan- 
sions with a finite order, modified kernel polynomials are 
preferably used. For example, the Dirac's delta function is 
well approximated by truncating Chebyshev expansion with 
the Jackson kernel ifTol . Moreover, in quantum statistical 



mechanics, it is required to evaluate the trace of large- 
dimensional Hamiltonian matrices. This trace is efficiently 
approximated by using random vectors [10] (we call it 
"stochastic trace method" in this paper). Therefore, combin- 
ing these two methods, truncated polynomial expansions and 
random vector bases, allows us to evaluate the DoS and other 
physical quantities with significantly reduced complexity. 

The computational cost inevitably increases with system 
sizes considered, and with the number of polynomials kept 
and random vectors generated to meet the desired accuracy. 
It is therefore expected to reduce the simulation latency 
drastically by implementing the KPM in parallel platform. 

Regarding computer hardware, the graphics processing 
units (GPU) have become available to be used for accel- 
eration platform as a substitute of CPU. This is due to the 
recent drastic performance growth of GPU. The recent GPU 
has already achieved the performance up to TFLOPS order 
Therefore, it is applied to various scientific fields to solve 
the grand challenge applications under a personal computing 
environment IfTTI . 

The program on GPU is called stream-based program 
which processes each data unit contained in input data 
streams, and generates the corresponding data unit forming 
output data streams. This computing style has benefits of 
1) eliminating memory access bottleneck, which is seen in 
the von Neumann style architecture, and 2) data parallelism 
because each data unit does not have any dependency in 
the data streams. The recent challenges to speedup intensive 
computations enforce algorithms to be redesigned to fit 
to GPU and to receive the benefit of especially the data 
parallelism characteristics assigning small operations to the 
enormous number of the stream processors. This computing 
style would become a typical computing style in the next 
supercomputing generation. 

This paper focuses on a GPU-based implementation of 
the KPM applying the stream-based computing style. We 
propose an effective implementation of the KPM on GPU 
to accelerate its performance faster than the recent CPU. As 
seen in the next section, vectors (higher order polynomials) 
are generated recursively. This characteristic is suffered to 
parallelize the KPM effectively in a CPU-based large system. 
Applying GPU resources and a stream-based programming 
style, this paper will challenge to overcome the performance 
limitation caused by the recursive operation. 

This paper is organized as follows. Section describes 
the detailed explanation of KPM and the overview of the 
general purpose computing on GPU. Section HIl] proposes 
the design and implementation of KPM on GPU. Section lTVl 
analyzes the performances of typical sets of input parameters 
used in condensed matter physics and discusses the program 
behaviors when the parameters change to increase resource 
usage regarding processor and memory. Finally, section |V] 
concludes this paper. 



II. Background and definitions 

A. Kernel polynomial method 

1 ) Definition: The basis of KPM is the following (Cheby- 
shev) polynomial expansion of a function f{x) defined in 
[-1,1], 
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and Tn (x) is the Chebyshev polynomial defined as 

Tn{x) = COS [narccos(x)] . 



(1) 



(2) 



(3) 



It should be mentioned that the Chebyshev polynomials 
satisfies the following recursion relations. 



Tq{x) = 1 , Ti{x) = X , 

Tn+2{x) = 2xTn+i{x) - T„(x) 



KPM is defined as 
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where the additional coefficients g„ given by a kernel which 
satisfies the limit 
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where || • || is suitable well-defined norm. 

2} Application to quantum systems: In quantum physics, 
we need to expand functions of the Hamiltonian matrix. In 
this paper, we focus on the density of state (DoS). Then, we 
show an example of application of KPM for calculation of 
DoS. 

We consider the system described by the Hamiltonian 
matrix H. First, we apply the following linear transformation 
in order to fit the spectrum of H to [—1, 1], 



where 



H ^ (H -a+)/a-, 



Q^it — (-^uppcr i -^lowcr)/2 , 
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The parameters -Euppci and -Eiowoi are the upper and lower 
limits of the eigenvalues of H obtained by the Gerschgorin 
theorem. 

The density of state (DoS) p{oj) of the I?-dimensional 
Hamiltonian matrix H is defined by 
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where Ek is the k-th eigenvalue and S{x) is the delta 
function. We apply the linear transformation (|8) and obtain 
the equation 

1 



Then /i„ is expressed by this expression as 
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where 



U) = {lu — Q;+)/a_ 
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In order to obtain the approximated DoS using KPM, the 
coefficients fin in this case is obtained as 

/x„ = y" (ia)p(w)T„(cZ>) 
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where |fc) is the fc-th eigenvector and (fc| = 

3) Stochastic evaluation of traces: In order to evaluate 
the trace in Eq.(fT3]l. we introduce the stochastic evaluation 
method of traces, which estimates /i„ by average over only 
a small number R D of randomly chosen vector 

First, we introduce an arbitrary basis {\i)} a set of inde- 
pendent identically distributed random variables {£,r,i\£,r.i G 
M} which in terms of the statistical average ((•)) fulfill 

((^r.»=0, iiCrAr'.')) ^Srr'Su^ , (14) 

a random vector is defined through 

D-l 
i=0 

Using them, we can approximately evaluate the trace as 
follows. 
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In order to make (r|T„(iJ)|r), we use the following 
recursive relations for the vectors |r„) := T„(i?)|r) derived 
from the relations (|4) and (|5]l, 

|ro) = |r), |ri)=i/|ro), (17) 
|r„+2) = 2iJ|r„+i) - |r„) . (18) 
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4) Numerical complexity: The numerical complexity of 
the KPM is 0{SRND) if the H is sparse matrix, where 
S is the number of the realization of the set of random 
variables {Cr,i}- The process costing 0{D) is the making 
part of |r„) shown in Eq. (fTSl l. which is the heaviest part 
in KPM. When the H is considered as a dense matrix, the 
complexity of the part becomes 0(_D^). The 0{SR) comes 
from the average and summation in Eq. ([19) and 0{N) from 
the summation in Eq. (|6]l. This numerical cost 0{SRND) 
is very effective against the full diagonalization which costs 
0{D^) if S,R,N < D"^, and the if is a sparse matrix. 
However, it is a dense matrix, the numerical cost becomes 
0{SRN D^) due to all multiplications for all elements in 
the H and the |r„) must be performed straightly without 
considering the CRS (Compressed Row Storage) format for 
a sparse matrix. This paper considers the simple case when 
the CRS format is not applied to the memory maintenance 
for the H. Therefore, all the elements in the H matrix are 
applied to all the calculations in the KPM. 

B. General purpose computing on GPUs 

1) GPU architecture: A video adapter that includes a 
GPU and a Video RAM (VRAM) is connected to a CPU's 
peripheral bus such as PCI Express. The video adapter 
works as a peripheral device of the CPU, and its GPU is 
controlled by the CPU to help a part of visualization tasks 
in the system. To utilize the GPU as a computing resource 
for GPGPU applications, the CPU downloads application 
program to the GPU's instruction memory and also prepares 
input data for the program. The program fetches the data 
and generates the result to the memory areas. The GPU 
reads/writes the VRAM directly to execute the calculation 
for the program. In this case, the original data is prepared in 
the main memory. The CPU copies the data to the VRAM. 
During the execution of the program, the GPU generates the 
results to the VRAM. The CPU copies the results from the 
VRAM to the main memory. 

The recent GPUs have only a kind of processor called 
the stream processor. The processor works for general 
purpose processes in any kind of calculation. However, 
the computing style must be followed in the stream-based 
one distributing elements included in streams into multiple 
stream processors. GPU uses two types of memory called 
global and shared memories. The global memory is provided 
by the memory placed outside of GPU such as DDR3 
VRAM. The shared memory is placed besides of the stream 
processor that works as if a cache. 

2) CUDA: The Compute Unified Device Architecture 
(CUD A) has been proposed by NVIDIA corporation [,12J . 
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Figure 1. A GPU architecture. 



The tools and APIs for programming on CUDA environment 
is now provided by the company's website. 

The CUDA assumes an architecture model as illustrated in 
Figure |2] (a). The model defines a GPU which is connected 
to a CPU's peripheral bus. A VRAM (the global memory) 
that maintains data used for calculation on the GPU is 
connected to the GPU. The data is copied from the host 
memory before the CPU commands to execute a program 
on the GPU. The program is executed as a thread in a 
thread block. The thread blocks are tiled in a matrix of from 
one to three dimensions. In the figure, thread blocks are 
tiled in two dimensions which size is Ugrid x mgrid- Each 
thread block has multiple threads in a matrix which size is 
varied from one to three dimensions. The figure also shows 
a thread block that includes nuock x "mbiock threads. Each 
thread block has individual shared memory space where 
shared valuables accessed among threads in the block are 
stored temporally. Thus, the program targeted to GPU in the 
CUDA environment is invoked as threads. The threads are 
grouped by the unit of the thread block. Therefore, obtaining 
a large parallelism, a large number of threads are invoked 
concurrently. 

In the program on the CUDA environment, the threads 
are described as a stream-based function written in C called 
a kernel function as shown in Figure|2](b). The program has 
two parts of the codes targeted to CPU and GPU, which is 
initially invoked by the CPU; a main program for CPU and 
a kernel function called as the thread on GPU. The kernel 

function is defined with the global directive so that 

it is executed on GPU. In the function, the global variables 
named gridDim, blockDim, blockldx, threadldx, 
implicitly declared by the CUDA runtime, are available to be 
used to specify the size of the grid and the thread block, the 
indices of the thread block and of the thread respectively. For 
example, using these global valuables. Figure |2](b) performs 
a summation of arrays A and B assigning each summation of 



Memory 



CPU 



Thread 



< 

a. 
> 



( ) Peripheral bus ( ) 
GPU :a E 



Thread 
Block 



Thread 
Block 



Thread 
Block 



Thread 
Block 



Thread 
Block 



Thread 
Block 



Thread 
Block 



-■moat 

(a) Execution model of kernel threads in CUDA 

int main(){ 
float *A, *B, *C; 

dim3dimBlock(OT„„,,4, 

dim3 dimGrid( m^^^ , n^„j); 

KernelFunc«< dimGrid , dimBlock »>(A, B, C); 

} 

global void KernelFunc(float *a, float *b, float *c){ 

int i = blockDim. X * blockDim. y * 

(gridDim. X * blockldx.y + blockldx.x) + threadldx.x; 
c[i] = a[i] + b[i]; 

(b) Example CUDA code for array summation 

Figure 2. CUDA programming environment. 



} 



the elements in those arrays to a thread and returns the result 
to the array C. The function is called by the main program 
specifying the sizes of the grid and the thread block with 
<<< >>>. Finally, reading data from the VRAM transferred 
by the main program, the kernel function is assigned to GPU, 
and runs as multiple threads. Thus, because programmer can 
just simply consider the stream-based kernel function and 
the calling code for the function in the main program, using 
the conventional C language manner, the CUDA provides an 
easy and transparent interface for GPGPU. 

According to the backgrounds we have mentioned above, 
it is important for the simulation in the quantum physics 
to apply a fast diagonalization method to reach the goal of 
the simulation quickly. However, the KPM has difficulty of 
fine-grain parallelization in large scale computers such as 
cluster computers or supercomputers due to the recursive 
calculation performed in the Eq. (fT¥t . Therefore, it is worth 
for us to implement the KPM on a GPU where the massively 
parallel environment is equipped with a large number of 
stream processors. Thus, this paper focuses on design and 
implementation of the KPM on GPU that challenges to 
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Figure 3. Design and implementaion of KPM on GPUs. 



achieve the advanced performance with applying the stream- 
based computing style on the massively parallel environ- 
ment. 

III. Kernel polynomial method applying GPUs 
A. Design for massively parallel platform 

Figure |3] summarizes the KPM algorithm. The step (1) 
generates randomly a vector ~^ that the number of elements 
is H_SIZE (this equals to the D in section UTAb . The step 
(2) gets ^„ from ^n_i and 1^,1-2 recursively calculating a 
matrix multiply of H and 'T'n-i in the step (2.1). This mul- 
tiplication is very hard to parallelize using MPI or OpenMP 
because of the dependencies due to the recursive iteration 
although the part needs the most intensive calculation. Then 
a dot product is calculated using ^„ again with ^ at the 
step (2.2) and generates /i„. Then the generation of the /i„ 
is iterated for RS times. This means each generation of /i„ 
can be massively parallelized on GPUs. Finally, the average 
of all the /i„s is generated at the step (3). N /^„s are finally 
generated from the RS-time iterations of the step (1) and (2). 
This generation of the moments achieves the objective of the 
KPM. This summation to generate /i„ can be parallelized 
on GPU. Therefore, implemented on GPUs, two parallel 
processing parts are entirely performed during the evaluation 
of the moments using KPM: a) generation of /i„ and b) 
generation of /i„. The maximum number of parallelism at 
the both a) and b) parts becomes the SR because the total 
number of threads executed in the stream processors is SR. 
Here, GPU has an architectural restriction to the number 
of threads in a thread block referred as BLOCK_SIZE in 
this paper Therefore, the number of thread blocks becomes 
RS/BLOCK_SIZE. Considering the paralleUzation tech- 
niques above, let us explain the implementation of a kernel 
program on CUDA that invokes both the a) and b) parts. 
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Figure 4. Parallelization of KPM. 



B. Implementation 

We have implemented a kernel program for GPU using 
CUDA. The kernel receives the H_SIZE, N that is the 
number of moments and RS as the arguments. All calcula- 
tions are performed based on double precision. The kernel 
includes two important concepts. One is how to keep high 
parallelism. Another is an effective memory management for 
the parallelism. 

1) Parallelization of calculations: As we discussed in the 
last section, two heavy calculation parts (the a) and the b)) 
should be parallelized and it would give the largest impact 
for the speedup. 

Figure |4] a) shows the generation part for the l^n- 
needs ^„-i, and 1^ that is randomly generated. 

These four vectors are obtained in the global memory and 
each block will write those vectors swapping the pointers. 
Here the number of blocks is RS/BLOCK_SIZE. In each 
block, BLOCK _SI ZE stream processors are concurrently 
working to generate a part of those vectors including the 
random number generation for 1^ . Therefore, this part 
will be fully parallelized into the total number of stream 
processors equipped on GPUs. This part will generate /2i, 



)l2 ... jj-N using 7" and ^„ for N time iteration. 

Figure |4] b) depicts the parallelization of generation for 
It performs just parallel summations for generating a 
scalar fin where all blocks works in parallel. 

2) Memory consumption: Let us consider the required 
memory amount for the operations in Figure 2] in the case 
of double precision. For the operation a), because four 
7^ vectors are stored in the global memory for a block. 
Each vector has H_SIZE elements. Therefore, this 
part consumes Number of Blocks x 4 x H_SIZE x 8 
bytes. The operation b) is parallelized into the number of 
blocks. Each block performs a part of summation using N 
/is. The length of /is is H_SIZE. Therefore, it needs totally 
Number of Blocks x N x H_SIZE x 8 bytes. 

The operation a) writes /t„ into the global memory. This 
needs to be kept with 1^ vectors simultaneously. Therefore, 
the total number of memory is Number of Blocks x 
H_SIZE X (8 X iV + 32). 

Due to the recursive relationships among 7^„, ~P'n-i 
and ~T'n-2, the KPM is treated generally as one of very 
hard parallelized algorithms. However, as we can see in this 
section, on the GPU, a massively parallel environment, the 
KPM is fully parallelized due to the stream-based computing 
concept. Thus, we can expect an effective speedup that will 
be proportional to the number of stream-processors. 

IV. Experimental performance analysis 

This section shows performance evaluations of the KPM 
implemented on GPU. The performance based on GPU is 
compared with the one based on CPU. The experimental 
environment is a PC that consists of an Intel's Core i7 930 
processor at 2.80GHz with 12GB DDR3 memory, and the 
NVIDIA Tesla C2050 with 3GB Memory is connected to 
the PCI Express bus. The configuration of the cache in the 
GPU is set to 16KB and the shared memory size is 48KB. 
The OS of the PC is the Cent OS of the Linux Kernel 2.6. 18. 
The driver version of the GPU is 3.0. All KPM calculations 
are performed with double precision floating point. The CPU 
version is compiled with GCC 4.4. 1 with 03 option. 

We perform three kinds of performance analysis: (1) 
evaluation using actual sets of parameters, (2) the one with 
increasing calculation size and (3) the one with increasing 
memory usage. The first evaluation hires sets of parameters 
used in actual simulations of the meaningful model applied 
to the condensed matter physics field. The second evalu- 
ation analyses the behavior of the performances when the 
parameter N is increased. This means that more intensive 
calculation is loaded to the CPU and the GPU following the 
increase of N. The last evaluation shows the performance 
impacts when the H_SIZE is increased. This case needs 
the square sized memory to store the H matrix that is 
increased by the impact of H_SIZE^. 
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Figure 6. The DoS comparison with trancations between N=256 and 
N=512 when the lattice is made of cubes placed in 10x10x10, R=14 and 
S=128. 



A. Performance analysis using actual simulation parameters 

In the field of the computational condensed matter 
physics, the KPM is applied to a simulation to evaluate the 
DoS in a three dimensional lattice model. Let us consider a 
lattice model made of cubes in 10 x lOx 10 where an electron 
is placed in each corner This model needs a Hamiltonian 
matrix sized in 1000 x 1000 due to the presentations of cor- 
relations among the electrons at each corner The significant 
characteristics of the matrix include that 1) it is sparse and 
symmetric and 2) any row contains seven non-zero elements 
with the condition where all diagonal ones are zeros and the 
other non-zero ones are —Is. 

We evaluate the DoS in the case of the lattice that we 
assumed above using the fixed parameters of the KPM with 

= 14 and = 128. Varying N from 128 to 1024 in 
the steps of 2", Figure |5] shows the execution times and the 
speedup comparing the performances on the CPU with the 
ones on the GPU. The speedup keeps 3.5 times for all the 
cases. This means that the simulation can be accelerated by 
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the GPU and the execution time becomes almost 40% faster 
than the one on CPU at most. 

We shall pickup two DoS data combinations from the 
parameter sets of Figure |5] and plot it to a graph as depicted 
in Figure |6] The graph shows the DoS when iV = 256 and 
N — 512. When iV is the smaller number, the truncation 
reduces to the resolution of the DoS. However, the process- 
ing time is smaller than the case of a large N . Therefore, 
although the case of N — 512 shows higher resolution of 
the DoS, it takes longer calculation time. 

B. Performance analysis with increased intensive calcula- 
tions 

Obtaining the fixed parameters of H_SIZE = 128, 
i? = 14 and S — 128, we measure the performances 
with varying the from 128 to 2048. The graph of the 
performances is illustrated in Figure |7] The graph shows 
the execution times with bars and the speedups (i.e. the 
CPU time is divided by the corresponding GPU time) with 
a line. As increasing the N, that is, as increasing the 
calculation amount, the speedup increases to almost 4 times. 
This means that the performance with the higher intensive 
calculations affected by the larger N causes higher effective 
data parallelism on GPU when the calculation amount is 
increased without changing the size of the memory usage. 
Thus, our implementation on GPU clearly achieves higher 
performance than the CPU-based KPM as increasing the 
calculation amount. 

C. Performance analysis with increased memory usage 

This analysis fixes N = 128, i? = 14 and S* = 128. 
We vary H_SIZE from 512 to 4096 with the step of 2". 
The performance presents effects caused by increasing the 
memory usage. The graph of the performance is depicted in 
Figure [8] When the amount of memory usage increases, the 
number of memory accesses increases. Therefore, the CPU 
version needs to read/write the memory as increased the 
size of H matrix. On the other hand, because the GPU can 



cache a part of the matrix into very fast shared memory and 
accesses the memory in the stream-based manner Thus, the 
execution time of the GPU version does not increase more 
than the complexity (0{H _SI ZE^)). This causes almost 
four times faster performance than the CPU version. 

As we discussed in three kinds of evaluations above, 
the performances on GPU achieve better performances than 
the ones on CPU due to the highly parallelism caused by 
the GPU-based implementation explained in this paper. The 
implementation achieves the advanced performance even if 
it is applied to the actual examples from the condensed 
matter physics or the cases with hard conditions virtually 
when the amounts of the computation and the memory usage 
are increased. Thus, we have confirmed that the KPM is a 
suitable algorithm that fits well to the GPU environment 
and the performance acceleration accomplishes amazingly 
the high performance. 

V. Conclusions 

This paper has proposed an implementation of the KPM 
widely used in the physics and the chemistry field to 
simulate various quantum states. Our GPU version shows 
about 4 times faster than the CPU one. Therefore, using 
a GPU, productivity of the moments for a quantum state 
is accelerated to four times. Therefore, the GPU version is 
expected to be used for various grand challenge simulations 
to find a new quantum state that resolves unknown physical 
theories in the natural phenomenon. 

For the future plans, we are considering to quest a method 
to find the best block size used in the GPU that defines 
the size of the stream processors' block. Moreover, the 
parallelization of the KPM on a message passing and a 
shared memory paradigm is also challenging because the 
recursive reference to get 7"„ becomes a bottleneck to be 
parallelized in fine-grain. Moreover, we are also planning to 
extend the GPU-based implementation to a GPU cluster for 
its parallelization. 
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